1 Executive Summary

Our best model is model6.

It has the highest R^2 value of 0.306, meaning that 30.6% of the variance in the natural logarithm of price_4_nights can be explained by the independent variables in the model. The factors which influence price_4_nights include the property type, which is split into the following categories: entire rental unit, private room in rental unit, private room in residential home, and other. All of these factors are statistically significant at the 95% level of confidence. Other variables included in the model are the review score rating, number of bedrooms, number of beds, the number of people the Airbnb can accommodate, the number of bathrooms; all of which are also significant. The last factors which influence the price for 4 nights are the ability to book instantly, and the availability. These are statistically significant at the 95% level of confidence, however, have a smaller impact on the price for 4 nights compared to the other variables.

2 Data Exploration and Feature Selection:

2.1 Data Collection:

In our final group assignment we have analysed data about Airbnb listings and fitted a model to predict the total cost for two people staying 4 nights in an AirBnB in a city. We downloaded AirBnB data from insideairbnb.com;it was originally scraped from airbnb.com.

We have selected Buenos Aires to work on and we used the vroom::vroom() function to download the AirBnB listing data from the Google sheet provided by Professor Kostis.

2.2 Understanding the Dataframe:

Before starting the analysis, we would like to introduce the datadrame to the reader of this document. There are many variables in the dataframe, below is a quick description of some of the variables collected, and you can find a data dictionary here.

  • price = cost per night

  • property_type: type of accommodation (House, Apartment, etc.)

  • room_type:

    • Entire home/apt (guests have entire place to themselves)
    • Private room (Guests have private room to sleep, all other rooms shared)
    • Shared room (Guests sleep in room shared with others)
  • number_of_reviews: Total number of reviews for the listing

  • review_scores_rating: Average review score (0 - 100)

  • longitude , latitude: geographical coordinates to help us locate the listing

  • neighbourhood*: three variables on a few major neighbourhoods in each city

3 Exploratory Data Analysis (EDA)

3.1 Looking at Raw Values

Let’s first take a look the raw dataframe.

#Allows us to see the various columns in a dataframe
glimpse(listings)
Rows: 18,438
Columns: 74
$ id                                           <dbl> 6283, 11508, 12463, 13095~
$ listing_url                                  <chr> "https://www.airbnb.com/r~
$ scrape_id                                    <dbl> 2.021093e+13, 2.021093e+1~
$ last_scraped                                 <date> 2021-09-29, 2021-09-28, ~
$ name                                         <chr> "Casa Al Sur", "Amazing L~
$ description                                  <chr> "<b>The space</b><br />Th~
$ neighborhood_overview                        <chr> NA, "AREA: PALERMO SOHO<b~
$ picture_url                                  <chr> "https://a0.muscache.com/~
$ host_id                                      <dbl> 13310, 42762, 48799, 5099~
$ host_url                                     <chr> "https://www.airbnb.com/u~
$ host_name                                    <chr> "Pamela", "Candela", "Mat~
$ host_since                                   <date> 2009-04-13, 2009-10-01, ~
$ host_location                                <chr> "New York, New York, Unit~
$ host_about                                   <chr> "I'm from Argentina but l~
$ host_response_time                           <chr> "N/A", "N/A", "N/A", "wit~
$ host_response_rate                           <chr> "N/A", "N/A", "N/A", "100~
$ host_acceptance_rate                         <chr> "N/A", "100%", "N/A", "N/~
$ host_is_superhost                            <lgl> FALSE, TRUE, FALSE, FALSE~
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/~
$ host_picture_url                             <chr> "https://a0.muscache.com/~
$ host_neighbourhood                           <chr> "Balvanera", "Palermo", "~
$ host_listings_count                          <dbl> 1, 1, 1, 7, 7, 7, 7, 7, 1~
$ host_total_listings_count                    <dbl> 1, 1, 1, 7, 7, 7, 7, 7, 1~
$ host_verifications                           <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified                       <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ neighbourhood                                <chr> NA, "Buenos Aires, Capita~
$ neighbourhood_cleansed                       <chr> "Balvanera", "Palermo", "~
$ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, N~
$ latitude                                     <dbl> -34.60523, -34.58184, -34~
$ longitude                                    <dbl> -58.41042, -58.42415, -58~
$ property_type                                <chr> "Entire rental unit", "En~
$ room_type                                    <chr> "Entire home/apt", "Entir~
$ accommodates                                 <dbl> 2, 2, 1, 2, 2, 2, 3, 4, 3~
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text                               <chr> "1 bath", "1 bath", "1 ba~
$ bedrooms                                     <dbl> NA, 1, 1, 1, 1, 1, 1, 1, ~
$ beds                                         <dbl> 1, 1, 1, 1, 2, 2, 3, 3, 1~
$ amenities                                    <chr> "[\"Pool\", \"Heating\", ~
$ price                                        <chr> "$4,930.00", "$6,408.00",~
$ minimum_nights                               <dbl> 3, 2, 1, 1, 1, 1, 1, 1, 5~
$ maximum_nights                               <dbl> 30, 1125, 4, 60, 60, 60, ~
$ minimum_minimum_nights                       <dbl> 3, 2, 1, 1, 1, 1, 1, 1, 5~
$ maximum_minimum_nights                       <dbl> 3, 2, 1, 1, 1, 1, 1, 1, 5~
$ minimum_maximum_nights                       <dbl> 30, 1125, 4, 60, 60, 60, ~
$ maximum_maximum_nights                       <dbl> 30, 1125, 4, 60, 60, 60, ~
$ minimum_nights_avg_ntm                       <dbl> 3.0, 2.0, 1.0, 1.0, 1.0, ~
$ maximum_nights_avg_ntm                       <dbl> 30, 1125, 4, 60, 60, 60, ~
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30                              <dbl> 30, 0, 30, 30, 30, 30, 30~
$ availability_60                              <dbl> 60, 0, 60, 60, 60, 60, 60~
$ availability_90                              <dbl> 90, 0, 90, 90, 90, 90, 90~
$ availability_365                             <dbl> 365, 148, 365, 365, 365, ~
$ calendar_last_scraped                        <date> 2021-09-29, 2021-09-28, ~
$ number_of_reviews                            <dbl> 1, 27, 20, 1, 0, 1, 0, 1,~
$ number_of_reviews_ltm                        <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0~
$ number_of_reviews_l30d                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ first_review                                 <date> 2011-01-31, 2015-01-05, ~
$ last_review                                  <date> 2011-01-31, 2021-04-03, ~
$ review_scores_rating                         <dbl> 4.00, 4.74, 4.76, 5.00, N~
$ review_scores_accuracy                       <dbl> 5.00, 4.92, 4.81, 5.00, N~
$ review_scores_cleanliness                    <dbl> 4.00, 4.85, 4.88, 5.00, N~
$ review_scores_checkin                        <dbl> 5.00, 4.88, 4.88, 5.00, N~
$ review_scores_communication                  <dbl> 5.00, 4.96, 4.88, 5.00, N~
$ review_scores_location                       <dbl> 4.00, 4.92, 4.75, 5.00, N~
$ review_scores_value                          <dbl> 4.00, 4.96, 4.88, 5.00, N~
$ license                                      <lgl> NA, NA, NA, NA, NA, NA, N~
$ instant_bookable                             <lgl> FALSE, FALSE, FALSE, FALS~
$ calculated_host_listings_count               <dbl> 1, 1, 1, 7, 7, 7, 7, 7, 1~
$ calculated_host_listings_count_entire_homes  <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 1~
$ calculated_host_listings_count_private_rooms <dbl> 0, 0, 1, 7, 7, 7, 7, 7, 0~
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month                            <dbl> 0.01, 0.33, 0.17, 0.03, N~

3.2 Summary Statistics

The, let’s take a look at the summary statistics of the dataframe.

3.2.1 Skimming the Dataframe

#getting summary statistics of dataframe
skimmed <- skim(listings)
#Using kable to make the table cleaner
kbl(skimmed) %>% 
  kable_classic(full_width = F, html_font = "Cambria")
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace Date.min Date.max Date.median Date.n_unique logical.mean logical.count numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character listing_url 0 1.0000000 33 37 0 18438 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character name 5 0.9997288 1 244 0 17864 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character description 823 0.9553639 1 1000 0 16957 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character neighborhood_overview 7166 0.6113461 1 1000 0 9920 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character picture_url 0 1.0000000 60 126 0 17995 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_url 0 1.0000000 38 43 0 11412 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_name 5 0.9997288 1 34 0 3139 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_location 85 0.9953900 2 204 0 957 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_about 7875 0.5728929 1 4037 0 5547 24 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_response_time 5 0.9997288 3 18 0 5 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_response_rate 5 0.9997288 2 4 0 66 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_acceptance_rate 5 0.9997288 2 4 0 87 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_thumbnail_url 5 0.9997288 55 106 0 11326 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_picture_url 5 0.9997288 57 109 0 11326 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_neighbourhood 3214 0.8256861 4 33 0 105 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_verifications 0 1.0000000 2 180 0 343 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character neighbourhood 7166 0.6113461 9 93 0 739 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character neighbourhood_cleansed 0 1.0000000 4 17 0 48 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character property_type 0 1.0000000 3 35 0 69 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character room_type 0 1.0000000 10 15 0 4 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character bathrooms_text 74 0.9959865 6 17 0 44 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character amenities 0 1.0000000 2 1517 0 17064 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character price 0 1.0000000 5 13 0 2142 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Date last_scraped 0 1.0000000 NA NA NA NA NA 2021-09-28 2021-10-06 2021-09-29 4 NA NA NA NA NA NA NA NA NA NA
Date host_since 5 0.9997288 NA NA NA NA NA 2008-08-29 2021-09-27 2016-04-06 3389 NA NA NA NA NA NA NA NA NA NA
Date calendar_last_scraped 0 1.0000000 NA NA NA NA NA 2021-09-28 2021-10-06 2021-09-29 4 NA NA NA NA NA NA NA NA NA NA
Date first_review 5994 0.6749105 NA NA NA NA NA 2010-10-19 2021-09-27 2019-01-11 2699 NA NA NA NA NA NA NA NA NA NA
Date last_review 5994 0.6749105 NA NA NA NA NA 2010-12-30 2021-09-29 2019-12-06 2066 NA NA NA NA NA NA NA NA NA NA
logical host_is_superhost 5 0.9997288 NA NA NA NA NA NA NA NA NA 0.2540552 FAL: 13750, TRU: 4683 NA NA NA NA NA NA NA NA
logical host_has_profile_pic 5 0.9997288 NA NA NA NA NA NA NA NA NA 0.9952802 TRU: 18346, FAL: 87 NA NA NA NA NA NA NA NA
logical host_identity_verified 5 0.9997288 NA NA NA NA NA NA NA NA NA 0.6693973 TRU: 12339, FAL: 6094 NA NA NA NA NA NA NA NA
logical neighbourhood_group_cleansed 18438 0.0000000 NA NA NA NA NA NA NA NA NA NaN : NA NA NA NA NA NA NA NA
logical bathrooms 18438 0.0000000 NA NA NA NA NA NA NA NA NA NaN : NA NA NA NA NA NA NA NA
logical calendar_updated 18438 0.0000000 NA NA NA NA NA NA NA NA NA NaN : NA NA NA NA NA NA NA NA
logical has_availability 0 1.0000000 NA NA NA NA NA NA NA NA NA 0.9981560 TRU: 18404, FAL: 34 NA NA NA NA NA NA NA NA
logical license 18437 0.0000542 NA NA NA NA NA NA NA NA NA 0.0000000 FAL: 1 NA NA NA NA NA NA NA NA
logical instant_bookable 0 1.0000000 NA NA NA NA NA NA NA NA NA 0.3884912 FAL: 11275, TRU: 7163 NA NA NA NA NA NA NA NA
numeric id 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 2.903211e+07 1.446644e+07 6.283000e+03 1.866906e+07 3.195715e+07 4.034960e+07 5.251016e+07 <U+2583><U+2583><U+2585><U+2587><U+2585>
numeric scrape_id 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 2.021093e+13 0.000000e+00 2.021093e+13 2.021093e+13 2.021093e+13 2.021093e+13 2.021093e+13 <U+2581><U+2581><U+2587><U+2581><U+2581>
numeric host_id 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 1.094351e+08 1.091353e+08 2.616000e+03 1.382351e+07 6.577595e+07 1.912728e+08 4.248618e+08 <U+2587><U+2582><U+2582><U+2582><U+2581>
numeric host_listings_count 5 0.9997288 NA NA NA NA NA NA NA NA NA NA NA 8.490696e+00 1.921978e+01 0.000000e+00 1.000000e+00 2.000000e+00 5.000000e+00 1.800000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric host_total_listings_count 5 0.9997288 NA NA NA NA NA NA NA NA NA NA NA 8.490696e+00 1.921978e+01 0.000000e+00 1.000000e+00 2.000000e+00 5.000000e+00 1.800000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric latitude 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA -3.459238e+01 1.796680e-02 -3.468962e+01 -3.460294e+01 -3.459159e+01 -3.458205e+01 -3.453498e+01 <U+2581><U+2581><U+2585><U+2587><U+2581>
numeric longitude 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA -5.841514e+01 2.948310e-02 -5.853093e+01 -5.843450e+01 -5.841437e+01 -5.839106e+01 -5.835541e+01 <U+2581><U+2581><U+2586><U+2587><U+2585>
numeric accommodates 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 2.793470e+00 1.529581e+00 0.000000e+00 2.000000e+00 2.000000e+00 4.000000e+00 1.600000e+01 <U+2587><U+2582><U+2581><U+2581><U+2581>
numeric bedrooms 2820 0.8470550 NA NA NA NA NA NA NA NA NA NA NA 1.351774e+00 9.335122e-01 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 4.100000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric beds 208 0.9887189 NA NA NA NA NA NA NA NA NA NA NA 1.901481e+00 1.786956e+00 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 9.000000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric minimum_nights 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 7.136891e+00 2.102563e+01 1.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00 7.300000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric maximum_nights 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 6.638437e+02 9.239233e+02 1.000000e+00 9.000000e+01 1.125000e+03 1.125000e+03 9.999900e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric minimum_minimum_nights 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 7.044582e+00 2.078429e+01 1.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00 7.300000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric maximum_minimum_nights 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 7.197364e+00 2.088650e+01 1.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00 7.300000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric minimum_maximum_nights 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 8.064373e+02 5.331394e+02 1.000000e+00 1.800000e+02 1.125000e+03 1.125000e+03 3.000000e+04 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric maximum_maximum_nights 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 4.666902e+05 3.162769e+07 1.000000e+00 1.820000e+02 1.125000e+03 1.125000e+03 2.147484e+09 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric minimum_nights_avg_ntm 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 7.116434e+00 2.080876e+01 1.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00 7.300000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric maximum_nights_avg_ntm 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 3.408627e+05 2.308562e+07 1.000000e+00 1.810000e+02 1.125000e+03 1.125000e+03 1.573841e+09 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric availability_30 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 2.002300e+01 1.232825e+01 0.000000e+00 6.000000e+00 2.800000e+01 3.000000e+01 3.000000e+01 <U+2583><U+2581><U+2581><U+2581><U+2587>
numeric availability_60 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 4.308358e+01 2.316322e+01 0.000000e+00 2.700000e+01 5.800000e+01 6.000000e+01 6.000000e+01 <U+2582><U+2581><U+2581><U+2581><U+2587>
numeric availability_90 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 6.745482e+01 3.260183e+01 0.000000e+00 5.600000e+01 8.800000e+01 9.000000e+01 9.000000e+01 <U+2582><U+2581><U+2581><U+2581><U+2587>
numeric availability_365 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 2.355917e+02 1.248700e+02 0.000000e+00 1.030000e+02 2.690000e+02 3.640000e+02 3.650000e+02 <U+2582><U+2583><U+2583><U+2582><U+2587>
numeric number_of_reviews 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 1.612469e+01 3.375891e+01 0.000000e+00 0.000000e+00 3.000000e+00 1.600000e+01 5.040000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric number_of_reviews_ltm 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 1.307246e+00 4.495545e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 9.800000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric number_of_reviews_l30d 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 1.308168e-01 5.903010e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.200000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric review_scores_rating 5994 0.6749105 NA NA NA NA NA NA NA NA NA NA NA 4.624139e+00 8.416064e-01 0.000000e+00 4.650000e+00 4.860000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric review_scores_accuracy 6276 0.6596160 NA NA NA NA NA NA NA NA NA NA NA 4.797227e+00 4.507652e-01 0.000000e+00 4.780000e+00 4.940000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric review_scores_cleanliness 6277 0.6595618 NA NA NA NA NA NA NA NA NA NA NA 4.678515e+00 5.143728e-01 0.000000e+00 4.590000e+00 4.830000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric review_scores_checkin 6276 0.6596160 NA NA NA NA NA NA NA NA NA NA NA 4.873893e+00 3.856636e-01 0.000000e+00 4.890000e+00 5.000000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric review_scores_communication 6276 0.6596160 NA NA NA NA NA NA NA NA NA NA NA 4.859365e+00 4.005052e-01 1.000000e+00 4.880000e+00 5.000000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric review_scores_location 6277 0.6595618 NA NA NA NA NA NA NA NA NA NA NA 4.877528e+00 3.287340e-01 1.000000e+00 4.880000e+00 5.000000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric review_scores_value 6280 0.6593991 NA NA NA NA NA NA NA NA NA NA NA 4.693264e+00 4.856298e-01 0.000000e+00 4.620000e+00 4.820000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric calculated_host_listings_count 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 7.811259e+00 1.841884e+01 1.000000e+00 1.000000e+00 2.000000e+00 4.000000e+00 1.370000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric calculated_host_listings_count_entire_homes 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 7.028691e+00 1.833470e+01 0.000000e+00 1.000000e+00 1.000000e+00 3.000000e+00 1.370000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric calculated_host_listings_count_private_rooms 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 5.451242e-01 1.760177e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.100000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric calculated_host_listings_count_shared_rooms 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 6.996420e-02 6.805771e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.700000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric reviews_per_month 5994 0.6749105 NA NA NA NA NA NA NA NA NA NA NA 6.032682e-01 7.558439e-01 1.000000e-02 1.100000e-01 3.200000e-01 8.100000e-01 8.110000e+00 <U+2587><U+2581><U+2581><U+2581><U+2581>

There are 18438 observations and 74 variables in the dataframe.Among this 23 variables are in character format, 5 are in date format, 9 are in logical format, and 37 are in numeric format.

The following variables are in numeric format:

# finding the columns which are numeric
nums <- unlist(lapply(listings, is.numeric))

#selecting only numeric variables
kbl(colnames(listings[,nums])) %>% 
  kable_classic(full_width = F, html_font = "Cambria")
x
id
scrape_id
host_id
host_listings_count
host_total_listings_count
latitude
longitude
accommodates
bedrooms
beds
minimum_nights
maximum_nights
minimum_minimum_nights
maximum_minimum_nights
minimum_maximum_nights
maximum_maximum_nights
minimum_nights_avg_ntm
maximum_nights_avg_ntm
availability_30
availability_60
availability_90
availability_365
number_of_reviews
number_of_reviews_ltm
number_of_reviews_l30d
review_scores_rating
review_scores_accuracy
review_scores_cleanliness
review_scores_checkin
review_scores_communication
review_scores_location
review_scores_value
calculated_host_listings_count
calculated_host_listings_count_entire_homes
calculated_host_listings_count_private_rooms
calculated_host_listings_count_shared_rooms
reviews_per_month

The following are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values.

# Getting column numbers of categorical variables 
fact <- unlist(lapply(listings, is.character))  
# listing only categorical variables
kbl(colnames(listings[,fact])) %>% 
  kable_classic(full_width = F, html_font = "Cambria")
x
listing_url
name
description
neighborhood_overview
picture_url
host_url
host_name
host_location
host_about
host_response_time
host_response_rate
host_acceptance_rate
host_thumbnail_url
host_picture_url
host_neighbourhood
host_verifications
neighbourhood
neighbourhood_cleansed
property_type
room_type
bathrooms_text
amenities
price

3.2.2 Correlation between Numeric Variables

# to show the correlation between all the numeric variables in the dataframe
kbl(cor(listings[,nums])) %>% 
  kable_classic(full_width = F, html_font = "Cambria")
id scrape_id host_id host_listings_count host_total_listings_count latitude longitude accommodates bedrooms beds minimum_nights maximum_nights minimum_minimum_nights maximum_minimum_nights minimum_maximum_nights maximum_maximum_nights minimum_nights_avg_ntm maximum_nights_avg_ntm availability_30 availability_60 availability_90 availability_365 number_of_reviews number_of_reviews_ltm number_of_reviews_l30d review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
id 1.0000000 NA 0.4829823 NA NA 0.0144202 -0.0368658 0.0256595 NA NA -0.0326757 -0.0260183 -0.0321714 -0.0349780 0.0426763 0.0167889 -0.0332481 0.0167893 -0.0163400 -0.0099564 -0.0069541 -0.1354264 -0.3550375 0.0552288 0.0648765 NA NA NA NA NA NA NA 0.0638544 0.0683429 -0.0536755 0.0093959 NA
scrape_id NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
host_id 0.4829823 NA 1.0000000 NA NA -0.0908530 0.0046427 -0.0320289 NA NA -0.0424129 -0.0387636 -0.0436154 -0.0456054 -0.0019859 0.0343178 -0.0444780 0.0343177 0.0758486 0.0701194 0.0622490 -0.0677422 -0.1804557 -0.0106043 0.0017619 NA NA NA NA NA NA NA -0.1879010 -0.1863639 -0.0106754 0.0351991 NA
host_listings_count NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
host_total_listings_count NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
latitude 0.0144202 NA -0.0908530 NA NA 1.0000000 -0.5372205 0.0139465 NA NA 0.0419546 0.0037940 0.0423345 0.0417211 0.0167720 0.0005743 0.0425886 0.0005749 -0.1054029 -0.0954205 -0.0911168 -0.0358769 0.0214887 0.0637330 0.0492606 NA NA NA NA NA NA NA 0.0541229 0.0686779 -0.1154082 -0.0462795 NA
longitude -0.0368658 NA 0.0046427 NA NA -0.5372205 1.0000000 0.0524208 NA NA -0.0213042 0.0168961 -0.0228477 -0.0220538 0.0309381 0.0120328 -0.0230126 0.0120334 0.0766677 0.0665711 0.0636355 0.0299735 0.0441809 -0.0274220 -0.0161340 NA NA NA NA NA NA NA 0.0304587 0.0210860 0.0402895 0.0315477 NA
accommodates 0.0256595 NA -0.0320289 NA NA 0.0139465 0.0524208 1.0000000 NA NA -0.0035614 0.0374635 -0.0062106 -0.0062365 0.0691408 0.0019902 -0.0060214 0.0019776 -0.0707786 -0.0636930 -0.0609931 -0.0186892 0.0503732 0.0416317 0.0377335 NA NA NA NA NA NA NA 0.0652512 0.0802937 -0.0921696 -0.0385339 NA
bedrooms NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
beds NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
minimum_nights -0.0326757 NA -0.0424129 NA NA 0.0419546 -0.0213042 -0.0035614 NA NA 1.0000000 0.0049291 0.9840816 0.9831987 -0.0043955 -0.0042997 0.9851330 -0.0042997 0.0116827 0.0060989 0.0030280 0.0416406 -0.0522061 -0.0532184 -0.0466206 NA NA NA NA NA NA NA 0.0184966 0.0218590 -0.0237698 -0.0104447 NA
maximum_nights -0.0260183 NA -0.0387636 NA NA 0.0037940 0.0168961 0.0374635 NA NA 0.0049291 1.0000000 0.0050556 0.0056016 0.4818919 0.0073608 0.0052267 0.0073638 -0.0327755 -0.0348187 -0.0338110 0.0498394 -0.0029126 0.0033501 0.0051649 NA NA NA NA NA NA NA 0.0787046 0.0756952 0.0087460 -0.0149564 NA
minimum_minimum_nights -0.0321714 NA -0.0436154 NA NA 0.0423345 -0.0228477 -0.0062106 NA NA 0.9840816 0.0050556 1.0000000 0.9957239 -0.0042097 -0.0042842 0.9989711 -0.0042842 0.0142509 0.0078667 0.0045909 0.0432534 -0.0539792 -0.0541510 -0.0472828 NA NA NA NA NA NA NA 0.0197345 0.0229857 -0.0229149 -0.0101708 NA
maximum_minimum_nights -0.0349780 NA -0.0456054 NA NA 0.0417211 -0.0220538 -0.0062365 NA NA 0.9831987 0.0056016 0.9957239 1.0000000 -0.0039128 -0.0043710 0.9983256 -0.0043710 0.0094797 0.0044184 0.0016244 0.0423954 -0.0531178 -0.0529026 -0.0461870 NA NA NA NA NA NA NA 0.0198667 0.0228781 -0.0198810 -0.0106594 NA
minimum_maximum_nights 0.0426763 NA -0.0019859 NA NA 0.0167720 0.0309381 0.0691408 NA NA -0.0043955 0.4818919 -0.0042097 -0.0039128 1.0000000 -0.0214643 -0.0039425 -0.0214580 -0.0735273 -0.0719075 -0.0665879 0.0357631 0.0461889 0.0643887 0.0508856 NA NA NA NA NA NA NA 0.0763266 0.0742032 0.0096837 -0.0457397 NA
maximum_maximum_nights 0.0167889 NA 0.0343178 NA NA 0.0005743 0.0120328 0.0019902 NA NA -0.0042997 0.0073608 -0.0042842 -0.0043710 -0.0214643 1.0000000 -0.0043300 0.9999973 -0.0060032 -0.0083219 -0.0101472 0.0026441 -0.0070354 -0.0042825 -0.0032636 NA NA NA NA NA NA NA -0.0030469 -0.0056459 0.0289141 -0.0015151 NA
minimum_nights_avg_ntm -0.0332481 NA -0.0444780 NA NA 0.0425886 -0.0230126 -0.0060214 NA NA 0.9851330 0.0052267 0.9989711 0.9983256 -0.0039425 -0.0043300 1.0000000 -0.0043301 0.0119852 0.0059573 0.0027342 0.0429191 -0.0534382 -0.0532171 -0.0468338 NA NA NA NA NA NA NA 0.0198302 0.0230716 -0.0224429 -0.0104778 NA
maximum_nights_avg_ntm 0.0167893 NA 0.0343177 NA NA 0.0005749 0.0120334 0.0019776 NA NA -0.0042997 0.0073638 -0.0042842 -0.0043710 -0.0214580 0.9999973 -0.0043301 1.0000000 -0.0059794 -0.0082965 -0.0101201 0.0026584 -0.0070350 -0.0042820 -0.0032633 NA NA NA NA NA NA NA -0.0030464 -0.0056455 0.0289141 -0.0015154 NA
availability_30 -0.0163400 NA 0.0758486 NA NA -0.1054029 0.0766677 -0.0707786 NA NA 0.0116827 -0.0327755 0.0142509 0.0094797 -0.0735273 -0.0060032 0.0119852 -0.0059794 1.0000000 0.9566171 0.9106184 0.3732122 -0.1802830 -0.1525458 -0.0922873 NA NA NA NA NA NA NA -0.0527074 -0.0644003 0.0889616 -0.0113171 NA
availability_60 -0.0099564 NA 0.0701194 NA NA -0.0954205 0.0665711 -0.0636930 NA NA 0.0060989 -0.0348187 0.0078667 0.0044184 -0.0719075 -0.0083219 0.0059573 -0.0082965 0.9566171 1.0000000 0.9794879 0.4035407 -0.1658692 -0.1268526 -0.0734154 NA NA NA NA NA NA NA -0.0305657 -0.0397934 0.0771196 -0.0228552 NA
availability_90 -0.0069541 NA 0.0622490 NA NA -0.0911168 0.0636355 -0.0609931 NA NA 0.0030280 -0.0338110 0.0045909 0.0016244 -0.0665879 -0.0101472 0.0027342 -0.0101201 0.9106184 0.9794879 1.0000000 0.4245896 -0.1551103 -0.1153777 -0.0642781 NA NA NA NA NA NA NA -0.0129807 -0.0204452 0.0684765 -0.0316898 NA
availability_365 -0.1354264 NA -0.0677422 NA NA -0.0358769 0.0299735 -0.0186892 NA NA 0.0416406 0.0498394 0.0432534 0.0423954 0.0357631 0.0026441 0.0429191 0.0026584 0.3732122 0.4035407 0.4245896 1.0000000 -0.0533246 -0.1014549 -0.0689872 NA NA NA NA NA NA NA 0.0538917 0.0411271 0.0617789 0.0314637 NA
number_of_reviews -0.3550375 NA -0.1804557 NA NA 0.0214887 0.0441809 0.0503732 NA NA -0.0522061 -0.0029126 -0.0539792 -0.0531178 0.0461889 -0.0070354 -0.0534382 -0.0070350 -0.1802830 -0.1658692 -0.1551103 -0.0533246 1.0000000 0.3348992 0.2387524 NA NA NA NA NA NA NA -0.0554002 -0.0450813 -0.0631179 -0.0409227 NA
number_of_reviews_ltm 0.0552288 NA -0.0106043 NA NA 0.0637330 -0.0274220 0.0416317 NA NA -0.0532184 0.0033501 -0.0541510 -0.0529026 0.0643887 -0.0042825 -0.0532171 -0.0042820 -0.1525458 -0.1268526 -0.1153777 -0.1014549 0.3348992 1.0000000 0.7061156 NA NA NA NA NA NA NA 0.0530862 0.0624504 -0.0651933 -0.0274130 NA
number_of_reviews_l30d 0.0648765 NA 0.0017619 NA NA 0.0492606 -0.0161340 0.0377335 NA NA -0.0466206 0.0051649 -0.0472828 -0.0461870 0.0508856 -0.0032636 -0.0468338 -0.0032633 -0.0922873 -0.0734154 -0.0642781 -0.0689872 0.2387524 0.7061156 1.0000000 NA NA NA NA NA NA NA 0.0271289 0.0345929 -0.0511487 -0.0207579 NA
review_scores_rating NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA
review_scores_accuracy NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA
review_scores_cleanliness NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA
review_scores_checkin NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA
review_scores_communication NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA
review_scores_location NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA
review_scores_value NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
calculated_host_listings_count 0.0638544 NA -0.1879010 NA NA 0.0541229 0.0304587 0.0652512 NA NA 0.0184966 0.0787046 0.0197345 0.0198667 0.0763266 -0.0030469 0.0198302 -0.0030464 -0.0527074 -0.0305657 -0.0129807 0.0538917 -0.0554002 0.0530862 0.0271289 NA NA NA NA NA NA NA 1.0000000 0.9862029 0.0262926 0.0333663 NA
calculated_host_listings_count_entire_homes 0.0683429 NA -0.1863639 NA NA 0.0686779 0.0210860 0.0802937 NA NA 0.0218590 0.0756952 0.0229857 0.0228781 0.0742032 -0.0056459 0.0230716 -0.0056455 -0.0644003 -0.0397934 -0.0204452 0.0411271 -0.0450813 0.0624504 0.0345929 NA NA NA NA NA NA NA 0.9862029 1.0000000 -0.0708269 -0.0379510 NA
calculated_host_listings_count_private_rooms -0.0536755 NA -0.0106754 NA NA -0.1154082 0.0402895 -0.0921696 NA NA -0.0237698 0.0087460 -0.0229149 -0.0198810 0.0096837 0.0289141 -0.0224429 0.0289141 0.0889616 0.0771196 0.0684765 0.0617789 -0.0631179 -0.0651933 -0.0511487 NA NA NA NA NA NA NA 0.0262926 -0.0708269 1.0000000 0.0823490 NA
calculated_host_listings_count_shared_rooms 0.0093959 NA 0.0351991 NA NA -0.0462795 0.0315477 -0.0385339 NA NA -0.0104447 -0.0149564 -0.0101708 -0.0106594 -0.0457397 -0.0015151 -0.0104778 -0.0015154 -0.0113171 -0.0228552 -0.0316898 0.0314637 -0.0409227 -0.0274130 -0.0207579 NA NA NA NA NA NA NA 0.0333663 -0.0379510 0.0823490 1.0000000 NA
reviews_per_month NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1

3.2.3 Scatterplot Matrix of Key Variables

Since there are too many variables in the original dataframe, we hand-selected some variables that our group members think people would pay more attention to when booking and AirBnB and made a scatterplot matrix.

#to plot scatterplot matrix
matrix <- ggpairs(listings[,c("accommodates","bedrooms","minimum_nights","maximum_nights", "number_of_reviews", "review_scores_rating", "reviews_per_month", "review_scores_accuracy" ,"host_identity_verified","instant_bookable", "host_response_time", "room_type")])

matrix

Accommodates and number of bedrooms, number of reviews and reviews per month, and review score accuracy and review score rating have a strong correlation.

3.3 Data wrangling

3.3.1 Pre-processing Price Data

We noticed that some of the price data (price) is given as a character string, e.g., “$176.00”. Since price is a quantitative variable, we need to make sure it is stored as numeric data num in the dataframe.

listings <- listings %>% 
  
  #to change the price variable into numeric format
  mutate(price = parse_number(price))

#to check that the changes are implemented
typeof(listings$price)
[1] "double"

3.3.2 Pre-processing Propery Types Data

There are to many property types and we want to simplify it.The top 4 property types are: Entire rental unit, private room in rental unit, Entire condominium (condo) and Private room in residential home.

# count the number of listings that fall under one of the 4 top property types
count_top_4_prop_types <- listings %>% 
  filter(property_type %in% c("Entire rental unit", 
                              "Private room in rental unit",
                              "Entire condominium (condo)",
                              "Private room in residential home")) %>% 
  summarize(sum = n())

# count the total number of listing
total_listing_num <- listings %>% 
  summarize(sum = n())

count_top_4_prop_types/total_listing_num

sum
0.867
These include 82.55% of all listings.

Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable named prop_type_simplified that has 5 categories: the top four categories and Other.

listings <- listings %>%
  
  # to create a new variable named prop_type_simplified
  mutate(prop_type_simplified = case_when(
    
    # to set property types which are not among the top 4 types to "Other"
    property_type %in% c("Entire rental unit", "Private room in rental unit","Entire condominium (condo)","Private room in residential home") ~ property_type, 
    TRUE ~ "Other"
  ))

We then used the code below to confirm that prop_type_simplified was correctly made.

listings %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n))        
property_typeprop_type_simplifiedn
Entire rental unitEntire rental unit12395
Private room in rental unitPrivate room in rental unit2042
Entire condominium (condo)Entire condominium (condo)784
Private room in residential homePrivate room in residential home771
Entire loftOther600
Entire residential homeOther337
Entire serviced apartmentOther295
Shared room in rental unitOther207
Private room in serviced apartmentOther88
Shared room in residential homeOther83
Private room in condominium (condo)Other74
Room in hotelOther68
Room in boutique hotelOther65
Private room in bed and breakfastOther63
Room in bed and breakfastOther58
Private room in loftOther38
Room in hostelOther37
Room in serviced apartmentOther37
Private room in guest suiteOther36
Entire guest suiteOther28
Private room in casa particularOther25
Room in aparthotelOther21
Shared room in guesthouseOther21
Entire townhouseOther20
Private room in guesthouseOther20
Shared room in hostelOther18
Private room in hostelOther17
Private roomOther16
Entire placeOther15
Entire guesthouseOther14
Casa particularOther12
Private room in villaOther12
Private room in townhouseOther11
Shared room in bed and breakfastOther10
Shared room in condominium (condo)Other9
Tiny houseOther9
Entire villaOther8
Shared room in loftOther8
Shared room in serviced apartmentOther7
Shared room in villaOther7
Camper/RVOther5
Private room in chaletOther4
Private room in tiny houseOther4
Entire cabinOther3
Entire home/aptOther3
Shared roomOther3
BoatOther2
Earth houseOther2
Entire cottageOther2
Private room in cabinOther2
Shared room in boutique hotelOther2
Shared room in guest suiteOther2
Shared room in townhouseOther2
CampsiteOther1
CarOther1
Cycladic houseOther1
Entire bed and breakfastOther1
Entire in-lawOther1
FloorOther1
PensionOther1
Private room in boatOther1
Private room in castleOther1
Private room in dome houseOther1
Private room in dormOther1
Private room in farm stayOther1
Private room in floorOther1
Private room in in-lawOther1
Private room in resortOther1
TreehouseOther1

3.3.3 Selecting Minimum Nights Data

Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:

We first view the minimum_nights. Then we take a look at what are the major minimum nights of all the listings using a density plot.

# count the listings with different minimum nights
min_nights<- table(listings$minimum_nights)
kbl(min_nights) %>% 
  kable_classic(full_width = F, html_font = "Cambria")
Var1 Freq
1 4274
2 3767
3 3741
4 1237
5 1223
6 331
7 1630
8 12
9 18
10 239
11 2
12 20
13 6
14 181
15 338
16 3
17 1
18 3
19 5
20 112
21 27
22 2
24 2
25 29
26 3
27 7
28 168
29 36
30 626
31 14
40 11
45 5
50 3
55 1
58 1
60 80
61 1
65 1
71 1
75 1
79 1
80 3
85 1
89 2
90 144
92 1
100 8
112 1
120 36
130 6
150 2
175 1
179 1
180 34
200 5
240 1
300 4
359 4
360 4
365 15
500 1
730 1
#make a density plot of minimum nights of the listings
ggplot(listings,aes(x=minimum_nights))+
  geom_density()+
  theme_bw()+
   labs (
    title = "Density Plot of the Minimum Nights for all the Listings",
    x = "Minimum Nights",
    y = "Density"
  )+ 
  NULL

The most common values are 1,2,3,7 and 4 nights.

The 7 nights stand out as it is more common than 4 nights. This may be because most people tend to go out for a week which lead to 7 nights being more common than 4 nights. Furthermore, Many Airbnb hosts give out discounts for booking 7 nights which leads to higher sales.

Since we are forecasting the cost of 2 people staying for 4 nights, we will only select the data that have minimum_nights <= 4

listings_min4 <- listings %>%
  #filtering for data set with minimum nights less than or equal to 4
  filter(minimum_nights <= 4)

4 Mapping

Mapping the AirBnB locations to a map of Buenos Aires

# Creating a map with blue points referring to each AirBnB locaton in Buenos Aires
leaflet(data = filter(listings, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

5 Regression Analysis

We will use the cost for two people to stay at an Airbnb location for four (4) nights as our target variable \(Y\).

We created a new variable called price_4_nights that uses price, and accommodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.

pricefor2 <- listings_min4 %>%
  #filtering for minimum 2 accomodations and properties with greater than or equal to 4 nights
  filter(accommodates >= 2, maximum_nights >= 4) %>% 
  
  #create the price_4_nights variable
  mutate(price_4_nights = (price*4))

We realized that the there may be extreme high/low values in price_4_nights which may seriously affect our regression analysis, so we used histograms to examine the distributions of price_4_nights and log(price_4_nights) to decide the actual variable we are going to use in the regression model.

#plot the histogram for `price_4_nights`
ggplot(pricefor2, aes(x = price_4_nights))+
  geom_histogram()+
  NULL

#plot the histogram for `log(price_4_nights)`
ggplot(pricefor2, aes(x = log(price_4_nights)))+
  geom_histogram()+
NULL

Obviously, using log(price_4_nights) is better in the regression analysis because the effect of extreme values is reduced so we create a new variable as follows:

logpricefor2 <- pricefor2 %>% 
  # creating a new variable for the logarithm of price_4_nights
  mutate(logprice4nights = log(price_4_nights))

5.1 Model 1

For our initial model, we will be looking at the effect of property types, number of reviews and review score rating on log(price_4_nights)

#Creating a logarithmic regression model
model1 <- lm(logprice4nights ~
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating,
             data= logpricefor2)
#displaying regression estimates
summary(model1)

Call:
lm(formula = logprice4nights ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating, data = logpricefor2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0701 -0.4468 -0.0944  0.3431  5.6667 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           9.6212151  0.0581534
prop_type_simplifiedEntire rental unit               -0.0845437  0.0359128
prop_type_simplifiedOther                             0.0554728  0.0410768
prop_type_simplifiedPrivate room in rental unit      -0.6193214  0.0473098
prop_type_simplifiedPrivate room in residential home -0.7164308  0.0558697
number_of_reviews                                    -0.0011363  0.0001735
review_scores_rating                                 -0.0013568  0.0098422
                                                     t value Pr(>|t|)    
(Intercept)                                          165.445  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -2.354   0.0186 *  
prop_type_simplifiedOther                              1.350   0.1769    
prop_type_simplifiedPrivate room in rental unit      -13.091  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -12.823  < 2e-16 ***
number_of_reviews                                     -6.548 6.18e-11 ***
review_scores_rating                                  -0.138   0.8904    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6627 on 8413 degrees of freedom
  (3127 observations deleted due to missingness)
Multiple R-squared:  0.0623,    Adjusted R-squared:  0.06163 
F-statistic: 93.16 on 6 and 8413 DF,  p-value: < 2.2e-16

Interpretation of coefficients:

  • A 1 unit increase in review_scores_rating leads to a decrease of price_4_nights by 0.13%

  • A 1 unit increase in number of reviews decreases the price_4_nights by 0.11%

  • If the property type is entire rental unit, the price_4_nights decreases by 8.11% compared to the property type being an entire condominium

  • If the property type is other, the price_4_nights increases by 5.70% compared to the property type being an entire condominium

  • If the property type is private room in rental unit,the price_4_nights decreases by 46.17% compared to the property type being an entire condominium

  • If the property type is entire rental unit, the price_4_nights decreases by 51.15% compared to the property type being an entire condominium

After creating the first model, we wanted to determine if room_type is a significant predictor of the price for 4 nights. We decided to create a model with all variables in model1 and room_type

#Creating a new logarithmic regression model with additional variables
model2 <- lm(logprice4nights ~
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating +
               room_type,
             data= logpricefor2)
#displaying regression estimates
summary(model2)

Call:
lm(formula = logprice4nights ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + room_type, data = logpricefor2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2188 -0.4403 -0.0949  0.3396  5.6679 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           9.6878663  0.0571688
prop_type_simplifiedEntire rental unit               -0.0862458  0.0352109
prop_type_simplifiedOther                             0.2311889  0.0424067
prop_type_simplifiedPrivate room in rental unit      -0.0419747  0.0743405
prop_type_simplifiedPrivate room in residential home -0.1384833  0.0798644
number_of_reviews                                    -0.0012826  0.0001704
review_scores_rating                                 -0.0143446  0.0096839
room_typeHotel room                                   0.0743577  0.0863216
room_typePrivate room                                -0.5824204  0.0582948
room_typeShared room                                 -1.4370858  0.0878270
                                                     t value Pr(>|t|)    
(Intercept)                                          169.461  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -2.449   0.0143 *  
prop_type_simplifiedOther                              5.452 5.13e-08 ***
prop_type_simplifiedPrivate room in rental unit       -0.565   0.5723    
prop_type_simplifiedPrivate room in residential home  -1.734   0.0830 .  
number_of_reviews                                     -7.527 5.74e-14 ***
review_scores_rating                                  -1.481   0.1386    
room_typeHotel room                                    0.861   0.3890    
room_typePrivate room                                 -9.991  < 2e-16 ***
room_typeShared room                                 -16.363  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6498 on 8410 degrees of freedom
  (3127 observations deleted due to missingness)
Multiple R-squared:  0.09892,   Adjusted R-squared:  0.09796 
F-statistic: 102.6 on 9 and 8410 DF,  p-value: < 2.2e-16
#Running an F test on model 2 and model 1
anova(model2,model1)
Res.DfRSSDfSum of SqFPr(>F)
8.41e+033.55e+03       
8.41e+033.7e+03 -3-1441142.54e-72

We ran an F-test on model 2 and model 1 and determined that room_type is a significant predictor of the cost for 4 nights.

5.2 Further variables/questions to explore on our own

Our dataset contained many more variables. We decide to explore further variables to determine if they were significant predictors of price_4_nights

We started with number of bedrooms, beds, or size of the house (accomodates)

#Exploring further variables in regression
model3 <- lm(logprice4nights ~
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating +
               bedrooms +
               beds +
               accommodates ,
             data = logpricefor2)
#displaying regression estimates
summary(model3)

Call:
lm(formula = logprice4nights ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + bedrooms + beds + accommodates, data = logpricefor2)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7322 -0.3868 -0.0510  0.3335  5.8590 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           8.9399683  0.0589011
prop_type_simplifiedEntire rental unit               -0.0968416  0.0361957
prop_type_simplifiedOther                            -0.0755567  0.0410870
prop_type_simplifiedPrivate room in rental unit      -0.5066357  0.0460417
prop_type_simplifiedPrivate room in residential home -0.7395637  0.0532870
number_of_reviews                                    -0.0008880  0.0001763
review_scores_rating                                  0.0098442  0.0094550
bedrooms                                              0.2801877  0.0147588
beds                                                 -0.0617795  0.0072554
accommodates                                          0.1348535  0.0088166
                                                     t value Pr(>|t|)    
(Intercept)                                          151.779  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -2.676  0.00748 ** 
prop_type_simplifiedOther                             -1.839  0.06597 .  
prop_type_simplifiedPrivate room in rental unit      -11.004  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -13.879  < 2e-16 ***
number_of_reviews                                     -5.036 4.86e-07 ***
review_scores_rating                                   1.041  0.29784    
bedrooms                                              18.984  < 2e-16 ***
beds                                                  -8.515  < 2e-16 ***
accommodates                                          15.295  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5943 on 6914 degrees of freedom
  (4623 observations deleted due to missingness)
Multiple R-squared:  0.2831,    Adjusted R-squared:  0.2821 
F-statistic: 303.3 on 9 and 6914 DF,  p-value: < 2.2e-16

We then converted the character bathroom_text variable into a numeric variable and added it to model 3 to determine if it is a significant predictor.

#Creating a table with frequency of each bathroom type
bathroom_freq <- table(logpricefor2$bathrooms_text)
View(bathroom_freq)

#Extracting numeric values from bathroom text
logpricefor2_bathroom <- logpricefor2 %>% 
  #extracting digits from bathroom text
  mutate(bathrooms_numeric = str_extract(bathrooms_text,"[[:digit:]]+")) %>% 
  #Parsing for numbers
  mutate(bathrooms_numeric = parse_number(bathrooms_numeric))

#Adding bathroom numbers to the regression model
model4 <- lm(logprice4nights ~
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating +
               bedrooms +
               beds +
               accommodates +
               bathrooms_numeric ,
             data= logpricefor2_bathroom)

#Displaying Regression estimates
summary(model4)

Call:
lm(formula = logprice4nights ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + bedrooms + beds + accommodates + bathrooms_numeric, 
    data = logpricefor2_bathroom)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1492 -0.3844 -0.0471  0.3321  5.8724 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           8.8997067  0.0591348
prop_type_simplifiedEntire rental unit               -0.0948874  0.0361022
prop_type_simplifiedOther                            -0.0942160  0.0411347
prop_type_simplifiedPrivate room in rental unit      -0.5265953  0.0461533
prop_type_simplifiedPrivate room in residential home -0.7678637  0.0533739
number_of_reviews                                    -0.0008593  0.0001760
review_scores_rating                                  0.0115643  0.0094352
bedrooms                                              0.2537917  0.0154107
beds                                                 -0.0634780  0.0072426
accommodates                                          0.1257527  0.0089453
bathrooms_numeric                                     0.0846850  0.0146073
                                                     t value Pr(>|t|)    
(Intercept)                                          150.499  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -2.628   0.0086 ** 
prop_type_simplifiedOther                             -2.290   0.0220 *  
prop_type_simplifiedPrivate room in rental unit      -11.410  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -14.387  < 2e-16 ***
number_of_reviews                                     -4.883 1.07e-06 ***
review_scores_rating                                   1.226   0.2204    
bedrooms                                              16.469  < 2e-16 ***
beds                                                  -8.765  < 2e-16 ***
accommodates                                          14.058  < 2e-16 ***
bathrooms_numeric                                      5.797 7.03e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5928 on 6897 degrees of freedom
  (4639 observations deleted due to missingness)
Multiple R-squared:  0.2867,    Adjusted R-squared:  0.2856 
F-statistic: 277.2 on 10 and 6897 DF,  p-value: < 2.2e-16
#Running a Variance Inflation Factor on the model created above
car::vif(model4)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.124220  4        1.014744
number_of_reviews    1.017573  1        1.008748
review_scores_rating 1.016955  1        1.008442
bedrooms             2.615285  1        1.617184
beds                 2.746143  1        1.657149
accommodates         3.963503  1        1.990855
bathrooms_numeric    1.737532  1        1.318155

Upon creating the model, we ran a Variance Inflation Factor to determine if the problem suffered from multicollinearity and did not find any such problem. We did notice that review score rating has consistently have a p value greater than 0.05. We believe that this is likely due to both number of reviews and review score rating being present in the same model as people are likely to put in more reviews if they have negative opinions of the property.

We then decied to explore the effects of the host being a super host, whether the property is instantly bookable and if the property is available within 30 days. We ran two models with the variables to decide whether we should keep number of reviews or review score ratings.

#Adding new predictors to model and removing review score rating

model5 <- lm(logprice4nights ~
               prop_type_simplified +
               number_of_reviews +
               bedrooms +
               beds +
               accommodates +
               bathrooms_numeric +
               host_is_superhost +
               instant_bookable +
               availability_30 ,
             data= logpricefor2_bathroom)


#displaying regression estimates
summary(model5)

Call:
lm(formula = logprice4nights ~ prop_type_simplified + number_of_reviews + 
    bedrooms + beds + accommodates + bathrooms_numeric + host_is_superhost + 
    instant_bookable + availability_30, data = logpricefor2_bathroom)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3084 -0.4040 -0.0674  0.3356  5.9906 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           8.9572488  0.0360313
prop_type_simplifiedEntire rental unit               -0.0937053  0.0307103
prop_type_simplifiedOther                            -0.1153607  0.0349925
prop_type_simplifiedPrivate room in rental unit      -0.6044183  0.0392059
prop_type_simplifiedPrivate room in residential home -0.7592464  0.0461346
number_of_reviews                                    -0.0009400  0.0001912
bedrooms                                              0.1227879  0.0093417
beds                                                 -0.0400995  0.0048689
accommodates                                          0.1302920  0.0067425
bathrooms_numeric                                     0.0963275  0.0125347
host_is_superhostTRUE                                -0.0015915  0.0162011
instant_bookableTRUE                                 -0.0331453  0.0136200
availability_30                                       0.0080069  0.0005698
                                                     t value Pr(>|t|)    
(Intercept)                                          248.597  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -3.051 0.002285 ** 
prop_type_simplifiedOther                             -3.297 0.000982 ***
prop_type_simplifiedPrivate room in rental unit      -15.417  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -16.457  < 2e-16 ***
number_of_reviews                                     -4.917 8.95e-07 ***
bedrooms                                              13.144  < 2e-16 ***
beds                                                  -8.236  < 2e-16 ***
accommodates                                          19.324  < 2e-16 ***
bathrooms_numeric                                      7.685 1.68e-14 ***
host_is_superhostTRUE                                 -0.098 0.921746    
instant_bookableTRUE                                  -2.434 0.014969 *  
availability_30                                       14.052  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6535 on 9479 degrees of freedom
  (2055 observations deleted due to missingness)
Multiple R-squared:  0.2368,    Adjusted R-squared:  0.2359 
F-statistic: 245.1 on 12 and 9479 DF,  p-value: < 2.2e-16
#Adding new predictors to model and removing number of reviews
model6 <- lm(logprice4nights~ prop_type_simplified + review_scores_rating + bedrooms + beds + accommodates + bathrooms_numeric + instant_bookable + availability_30 , data= logpricefor2_bathroom)

#displaying regression estimates
summary(model6)

Call:
lm(formula = logprice4nights ~ prop_type_simplified + review_scores_rating + 
    bedrooms + beds + accommodates + bathrooms_numeric + instant_bookable + 
    availability_30, data = logpricefor2_bathroom)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1817 -0.3686 -0.0535  0.3203  6.0677 

Coefficients:
                                                      Estimate Std. Error
(Intercept)                                           8.712996   0.060944
prop_type_simplifiedEntire rental unit               -0.095423   0.035616
prop_type_simplifiedOther                            -0.117613   0.040611
prop_type_simplifiedPrivate room in rental unit      -0.585803   0.045733
prop_type_simplifiedPrivate room in residential home -0.805900   0.052730
review_scores_rating                                  0.021160   0.009338
bedrooms                                              0.254587   0.015208
beds                                                 -0.062975   0.007145
accommodates                                          0.125169   0.008815
bathrooms_numeric                                     0.080815   0.014420
instant_bookableTRUE                                 -0.039415   0.014449
availability_30                                       0.008223   0.000573
                                                     t value Pr(>|t|)    
(Intercept)                                          142.967  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -2.679  0.00740 ** 
prop_type_simplifiedOther                             -2.896  0.00379 ** 
prop_type_simplifiedPrivate room in rental unit      -12.809  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -15.284  < 2e-16 ***
review_scores_rating                                   2.266  0.02348 *  
bedrooms                                              16.740  < 2e-16 ***
beds                                                  -8.813  < 2e-16 ***
accommodates                                          14.200  < 2e-16 ***
bathrooms_numeric                                      5.604 2.17e-08 ***
instant_bookableTRUE                                  -2.728  0.00639 ** 
availability_30                                       14.351  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5848 on 6896 degrees of freedom
  (4639 observations deleted due to missingness)
Multiple R-squared:  0.3058,    Adjusted R-squared:  0.3047 
F-statistic: 276.2 on 11 and 6896 DF,  p-value: < 2.2e-16
#Creating a table with coefficients and statistical data of each regression model we created
huxreg(model1, model2, model3, model4, model5, model6)
(1)(2)(3)(4)(5)(6)
(Intercept)9.621 ***9.688 ***8.940 ***8.900 ***8.957 ***8.713 ***
(0.058)   (0.057)   (0.059)   (0.059)   (0.036)   (0.061)   
prop_type_simplifiedEntire rental unit-0.085 *  -0.086 *  -0.097 ** -0.095 ** -0.094 ** -0.095 ** 
(0.036)   (0.035)   (0.036)   (0.036)   (0.031)   (0.036)   
prop_type_simplifiedOther0.055    0.231 ***-0.076    -0.094 *  -0.115 ***-0.118 ** 
(0.041)   (0.042)   (0.041)   (0.041)   (0.035)   (0.041)   
prop_type_simplifiedPrivate room in rental unit-0.619 ***-0.042    -0.507 ***-0.527 ***-0.604 ***-0.586 ***
(0.047)   (0.074)   (0.046)   (0.046)   (0.039)   (0.046)   
prop_type_simplifiedPrivate room in residential home-0.716 ***-0.138    -0.740 ***-0.768 ***-0.759 ***-0.806 ***
(0.056)   (0.080)   (0.053)   (0.053)   (0.046)   (0.053)   
number_of_reviews-0.001 ***-0.001 ***-0.001 ***-0.001 ***-0.001 ***        
(0.000)   (0.000)   (0.000)   (0.000)   (0.000)           
review_scores_rating-0.001    -0.014    0.010    0.012            0.021 *  
(0.010)   (0.010)   (0.009)   (0.009)           (0.009)   
room_typeHotel room        0.074                                    
        (0.086)                                   
room_typePrivate room        -0.582 ***                                
        (0.058)                                   
room_typeShared room        -1.437 ***                                
        (0.088)                                   
bedrooms                0.280 ***0.254 ***0.123 ***0.255 ***
                (0.015)   (0.015)   (0.009)   (0.015)   
beds                -0.062 ***-0.063 ***-0.040 ***-0.063 ***
                (0.007)   (0.007)   (0.005)   (0.007)   
accommodates                0.135 ***0.126 ***0.130 ***0.125 ***
                (0.009)   (0.009)   (0.007)   (0.009)   
bathrooms_numeric                        0.085 ***0.096 ***0.081 ***
                        (0.015)   (0.013)   (0.014)   
host_is_superhostTRUE                                -0.002            
                                (0.016)           
instant_bookableTRUE                                -0.033 *  -0.039 ** 
                                (0.014)   (0.014)   
availability_30                                0.008 ***0.008 ***
                                (0.001)   (0.001)   
N8420        8420        6924        6908        9492        6908        
R20.062    0.099    0.283    0.287    0.237    0.306    
logLik-8480.079    -8312.351    -6217.026    -6183.779    -9424.350    -6089.866    
AIC16976.159    16646.702    12456.052    12391.558    18876.700    12205.731    
*** p < 0.001; ** p < 0.01; * p < 0.05.

We ultimately decided to use model6 as people are more likely to be affected by review score rating than by the number of reviews. Furthermore, the model contained the highest \[R^2\] among the 6 models. The final model contains 9 variables with all of the variables having a p value less than 0.05.`

5.3 Diagnostics, collinearity, summary tables

To determine if our model has a normal distribution, we used a Q-Q plot.

#Creating a Q-Q plot
autoplot(model6)[2]

As per the plot, our model has most of the data packed in the middle with fat tails in the end.

We ran the Variance Inflation Factor again to ensure the final model does not suffer for collinear variables.

#Running a Variance Inflation Factor on our final model
car::vif(model6)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.155113  4        1.018188
review_scores_rating 1.023347  1        1.011606
bedrooms             2.616805  1        1.617654
beds                 2.746113  1        1.657140
accommodates         3.954256  1        1.988531
bathrooms_numeric    1.739646  1        1.318956
instant_bookable     1.008725  1        1.004353
availability_30      1.049392  1        1.024398

After running VIF we realized that the final model did not suffer with collinear variables as VIF is less than 5 for all variables.

5.4 Predicting the cost of staying at Buenos Aires for 4 nights

#filtering for dataset for our criteria in staying at Buenos Airies 
filter <- logpricefor2_bathroom %>% 
  filter(prop_type_simplified %in% c("Private room in rental unit", "Private room in residential home"),
         number_of_reviews >=10,
         review_scores_rating>=4.5)

#Finding the predicted value of staying at Buenos Airies
model_predict <- predict(model6, newdata = filter, interval = "prediction")

#Finding the median of cost to stay
Cost_to_stay <- exp(median(model_predict[,1],
                           #ensuring na values are not used for median 
                           na.rm = TRUE))
Cost_to_stay
[1] 7243.908
#Creating a function that will create the confidence interval for us
confidence_interval <- function(vector, interval) {
  # Standard deviation of sample
  vec_sd <- sd(vector)
  # Sample size
  n <- length(vector)
  # Mean of sample
  vec_mean <- mean(vector)
  # Error according to t distribution
  error <- qt((interval + 1)/2, df = n - 1) * vec_sd / sqrt(n)
  # Confidence interval as a vector
  result <- c("lower" = vec_mean - error, "upper" = vec_mean + error)
  return(result)
}

#finding the 95% confience interval of log of price for 4 nights
CI_dependent <- confidence_interval(model_predict[1,], .95)

#Finding the actual value by using the exponential function
CI_lower_dependent <- exp(CI_dependent[1])
CI_upper_dependent <- exp(CI_dependent[2])

CI_upper_dependent
   upper 
940615.6 
CI_lower_dependent
   lower 
3071.019 

We estimate the average price to stay at Buenos Aires in a private room is 7,244 Argentinian Pesos. The 95% confidence interval for the price to stay is 3,071 and 940,615 Argentinian Pesos.

6 Findings and Recommendations

Our multiple linear regression model shows that the intercept is 8.712, this means that the natural logarithm of price_4_nights is 8.712. Since our base case is a condominium, this tells us the Expected Value for the logarithm of the price for 4 nights in a condominium, when all the explanatory variables are held at 0, is 8.712. If we take the exponential of this figure, we get 6075.38 Argentinian Peso, equivalent to £44.78. For the property type subcategories, all have negative coefficients and therefore any deviations from the base case property type (condominium) are associated with a decrease in the price for 4 nights. The largest effect is seen in properties classified as a private room in a residential home, with a coefficient of -0.806. This makes intuitive sense, as the decrease in privacy of having a room in a residential home could make it less attractive to travellers, and this could be compensated for by charging a lower price.

The review score rating has a positive relationship with price for 4 nights, a 1 unit increase in review score rating is associated with a 2.11% increase in the logarithm of price for 4 nights. Another positive relationship is also observed with number of bedrooms, the coefficient is 0.255, meaning that an extra bedroom is associated with a 25.5% increase in the logarithm of price for 4 nights. This is an expected result which makes logical sense as prices are likely to increase with the size of the property and capacity for people to sleep in.

Interestingly, the coefficient for number of beds is negative at -0.063, an extra bed leads to a decrease in log price for 4 nights of 6.3%. A positive relationship might be expected, as seen in number of bedrooms, however the negative relationship might be explained by a greater number of beds which could be small single beds which might be less desirable for travellers who are willing to pay higher prices.

As might be expected, the number of people the property can accommodate is positively correlated with price. With a coefficient of 0.125, an increase of capacity to accommodate an extra person is associated with a 12.5% increase in the log of price for 4 nights. An extra bathroom is associated with an 8.08% increase in the log of price for 4 nights.

The fact that those properties with an instant book feature are negatively related with price might seem surprising, the coefficient is -0.039, this would seem to be a useful feature for customers as it provides convenience. It is likely that the negative relationship exists because properties classed as more luxurious may be more selective in who they allow to stay, therefore not allowing the instant book feature, and these properties are also likely to command a higher price.

The availability of the property 30 days in the future has a weak positive correlation with the price for 4 nights, with a coefficient of 0.008, therefore higher availability is associated with a higher price.

7 Acknowledgements